In this project, we analyze a dataset of National Basketball Association (NBA) player statistics sourced from NBA.com and basketball-reference.com (see links at bottome of page). Our aim is to derive a model that can effectively predict a player’s net +/- rating, a metric indicative of a player’s impact on the court.
The dataset contains various details of NBA players’ performances, such as their points scored (PTS), field goal percentage (FG.%), three-point percentage (X3P.%), free throw attempts (FTA), offensive rebounds (OREB), defensive rebounds (DREB), assists (AST), turnovers (TOV), steals (STL), blocks (BLK), personal fouls (PF), and player position.
In basketball, players need to excel in several areas to contribute positively to their team’s performance. Understanding the combination of statistics that best predicts a player’s impact can provide valuable insights for team management in player evaluation, strategic decision-making, and overall team composition.
We will attempt to create a model that can predict a player’s net rating based on their individual performance metrics. We will use multiple linear regression modeling techniques and check LINE assumptions. We will address potential challenges such as multicollinearity and heteroscedasticity, if present. We will also evaluate our model’s performance using metrics such as the Adjusted R-squared value and conduct diagnostics to identify potential outliers and influential points.
player_data = read.csv("nbaplayerstats.csv")
team_data = read.csv("nbateamstats.csv")
position_data = read.csv("nba_position_data.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
position_data <- position_data %>%
group_by(First_Name, Last_Name) %>%
summarise(Position = names(which.max(table(Position))))
## `summarise()` has grouped output by 'First_Name'. You can override using the
## `.groups` argument.
player_data <- merge(player_data, position_data, by = c("First_Name", "Last_Name"), all.x = TRUE)
head(player_data)
## First_Name Last_Name TEAM AGE GP W L MIN PTS FGM FGA FG. X3PM X3PA
## 1 A.J. Lawson DAL 22 15 5 10 108.3 56 22 44 50.0 10 25
## 2 Aaron Gordon DEN 25 50 29 21 1383.8 618 231 499 46.3 59 176
## 3 Aaron Gordon DEN 27 68 45 23 2055.1 1109 429 761 56.4 60 173
## 4 Aaron Gordon DEN 26 75 46 29 2375.4 1126 434 834 52.0 87 260
## 5 Aaron Gordon ORL 24 62 30 32 2017.1 894 335 767 43.7 73 237
## 6 Aaron Henry PHI 22 6 6 0 17.0 2 1 5 20.0 0 1
## X3P. FTM FTA FT. OREB DREB REB AST TOV STL BLK PF FP DD2 TD3 rating Year
## 1 40.0 2 8 25.0 6 15 21 2 3 2 0 11 87 0 0 -46 2023
## 2 33.5 97 149 65.1 77 207 284 161 97 33 34 89 1304 3 1 60 2021
## 3 34.7 191 314 60.8 164 282 446 203 98 54 51 129 2166 11 0 518 2023
## 4 33.5 171 230 74.3 125 314 439 188 133 44 44 148 2066 6 0 321 2022
## 5 30.8 151 224 67.4 107 368 475 228 100 51 39 125 1976 20 1 -68 2020
## 6 0.0 0 0 0.0 0 1 1 0 2 0 2 2 7 0 0 -20 2022
## Position
## 1 SG
## 2 PF
## 3 PF
## 4 PF
## 5 PF
## 6 SF
summary(player_data)
## First_Name Last_Name TEAM AGE
## Length:2213 Length:2213 Length:2213 Min. :19.00
## Class :character Class :character Class :character 1st Qu.:23.00
## Mode :character Mode :character Mode :character Median :25.00
## Mean :26.08
## 3rd Qu.:29.00
## Max. :43.00
## GP W L MIN PTS
## Min. : 1 Min. : 0.00 Min. : 0.00 Min. : 0.6 Min. : 0.0
## 1st Qu.:24 1st Qu.:10.00 1st Qu.:11.00 1st Qu.: 267.3 1st Qu.: 92.0
## Median :49 Median :21.00 Median :22.00 Median : 924.2 Median : 342.0
## Mean :44 Mean :22.08 Mean :21.92 Mean :1004.1 Mean : 466.9
## 3rd Qu.:64 3rd Qu.:34.00 3rd Qu.:31.00 3rd Qu.:1653.8 3rd Qu.: 704.0
## Max. :83 Max. :64.00 Max. :60.00 Max. :2963.2 Max. :2335.0
## FGM FGA FG. X3PM
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 35.0 1st Qu.: 79.0 1st Qu.: 40.20 1st Qu.: 4.00
## Median :126.0 Median : 272.0 Median : 44.50 Median : 31.00
## Mean :171.2 Mean : 367.4 Mean : 44.63 Mean : 51.62
## 3rd Qu.:261.0 3rd Qu.: 564.0 3rd Qu.: 49.80 3rd Qu.: 82.00
## Max. :774.0 Max. :1564.0 Max. :100.00 Max. :337.00
## X3PA X3P. FTM FTA
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 15.0 1st Qu.: 25.40 1st Qu.: 11.00 1st Qu.: 15.00
## Median : 92.0 Median : 33.30 Median : 41.00 Median : 56.00
## Mean :143.6 Mean : 29.86 Mean : 72.92 Mean : 93.87
## 3rd Qu.:234.0 3rd Qu.: 37.90 3rd Qu.: 98.00 3rd Qu.:129.00
## Max. :843.0 Max. :100.00 Max. :692.00 Max. :803.00
## FT. OREB DREB REB
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 65.90 1st Qu.: 9.00 1st Qu.: 34.0 1st Qu.: 46.0
## Median : 76.10 Median : 27.00 Median :112.0 Median : 143.0
## Mean : 70.69 Mean : 42.32 Mean :141.5 Mean : 183.8
## 3rd Qu.: 83.70 3rd Qu.: 58.00 3rd Qu.:210.0 3rd Qu.: 271.0
## Max. :100.00 Max. :349.00 Max. :813.0 Max. :1019.0
## AST TOV STL BLK
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 17.0 1st Qu.: 12.00 1st Qu.: 7.0 1st Qu.: 4.00
## Median : 63.0 Median : 41.00 Median : 26.0 Median : 12.00
## Mean :103.1 Mean : 55.67 Mean : 31.3 Mean : 19.86
## 3rd Qu.:137.0 3rd Qu.: 80.00 3rd Qu.: 48.0 3rd Qu.: 26.00
## Max. :763.0 Max. :312.00 Max. :138.0 Max. :196.00
## PF FP DD2 TD3
## Min. : 0.00 Min. : -2.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 26.00 1st Qu.: 215.0 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 78.00 Median : 762.0 Median : 0.000 Median : 0.0000
## Mean : 82.75 Mean : 939.9 Mean : 3.646 Mean : 0.2219
## 3rd Qu.:126.00 3rd Qu.:1449.0 3rd Qu.: 3.000 3rd Qu.: 0.0000
## Max. :286.00 Max. :4338.0 Max. :66.000 Max. :38.0000
## rating Year Position
## Min. :-642 Min. :2020 Length:2213
## 1st Qu.: -66 1st Qu.:2021 Class :character
## Median : -8 Median :2022 Mode :character
## Mean : 0 Mean :2022
## 3rd Qu.: 51 3rd Qu.:2022
## Max. : 728 Max. :2023
na_positions <- sum(is.na(player_data$Position))
print(paste("Number of players without position: ", na_positions))
## [1] "Number of players without position: 0"
# Subset the data for players without position
#missing_positions <- player_data[is.na(player_data$Position),]
#print(missing_positions)
Now we have the data combined with all of our position data.
colnames(player_data)
## [1] "First_Name" "Last_Name" "TEAM" "AGE" "GP"
## [6] "W" "L" "MIN" "PTS" "FGM"
## [11] "FGA" "FG." "X3PM" "X3PA" "X3P."
## [16] "FTM" "FTA" "FT." "OREB" "DREB"
## [21] "REB" "AST" "TOV" "STL" "BLK"
## [26] "PF" "FP" "DD2" "TD3" "rating"
## [31] "Year" "Position"
colnames(team_data)
## [1] "City" "Team" "GP" "W" "L" "WIN." "MIN" "PTS"
## [9] "FGM" "FGA" "FG." "X3PM" "X3PA" "X3P." "FTM" "FTA"
## [17] "FT." "OREB" "DREB" "REB" "AST" "TOV" "STL" "BLK"
## [25] "BLKA" "PF" "PFD" "rating" "Year"
Here we check the column names of our player data and team data so we know what parameters we are dealing with.
team_name_mapping <- c(
"ATL" = "Hawks",
"BKN" = "Nets",
"BOS" = "Celtics",
"CHA" = "Hornets",
"CHI" = "Bulls",
"CLE" = "Cavaliers",
"DAL" = "Mavericks",
"DEN" = "Nuggets",
"DET" = "Pistons",
"GSW" = "Warriors",
"HOU" = "Rockets",
"IND" = "Pacers",
"LAC" = "Clippers",
"LAL" = "Lakers",
"MEM" = "Grizzlies",
"MIA" = "Heat",
"MIL" = "Bucks",
"MIN" = "Timberwolves",
"NOP" = "Pelicans",
"NYK" = "Knicks",
"OKC" = "Thunder",
"ORL" = "Magic",
"PHI" = "76ers",
"PHX" = "Suns",
"POR" = "Trail Blazers",
"SAC" = "Kings",
"SAS" = "Spurs",
"TOR" = "Raptors",
"UTA" = "Jazz",
"WAS" = "Wizards"
)
player_data$TEAM <- team_name_mapping[player_data$TEAM]
We modify the names of the teams in the player data to match the names of the teams in the team data.
library(dplyr)
player_data <- player_data %>%
mutate(per_game_rating = rating / GP)
team_data_small <- team_data %>%
select(Team, Year, rating)
names(team_data_small) <- c("TEAM", "Year", "team_rating")
player_data <- left_join(player_data,
team_data_small,
by = c("TEAM" = "TEAM", "Year" = "Year"))
player_data <- player_data %>%
mutate(player_net_rating = (per_game_rating - team_rating) * GP)
We create a new column called player_net_rating which calculates each player’s +/- rating relative to his team’s net +/- rating. This will insure that players on teasm’s that are better don’t get an advantage over player’s on worse teams.
# Subset data based on total minutes played condition
player_data <- player_data[player_data$MIN >= 100, ]
This removes players who played less than 100 minutes as these players are often statistical outliers having not played enough minutes for the law of large numbers to kick in.
library(ggplot2)
variables <- c("AGE", "PTS", "FG.", "X3P.", "FT.", "OREB", "DREB", "AST", "TOV", "STL", "BLK", "PF", "Position")
for (var in variables) {
p = ggplot(player_data, aes_string(x = var, y = "rating")) +
geom_point() +
labs(x = var, y = "Player Net Rating") +
theme_minimal() +
ggtitle(paste("Scatterplot of", var, "vs Player Net Rating"))
print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
These scatterplots show the relationship between our parameters and the player net ratings. This will help us determine some of the transformations we should try on our parameters.
full_model = lm(player_net_rating ~ AGE+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+OREB+DREB+AST+STL+BLK+as.factor(Position)+TOV+PF, data=player_data)
new_model = lm(I(player_net_rating^(1/3)) ~ I(1/(AGE^2))+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+DREB+AST+OREB+STL+BLK, data=player_data)
model = new_model
#model = lm(player_net_rating ~ .-First_Name-Last_Name-TEAM-Year-rating-per_game_rating-team_rating-L-FTM-REB-W-FP-DD2-TD3-MIN-FGA-FGM-X3PA-X3PM-FTA-GP, data = player_data)
#summary(model)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model)
## I(1/(AGE^2)) PTS I(FG.^2) I(X3P.^2) I(FT.^2) DREB
## 1.083603 4.699377 1.808656 1.103996 1.426448 5.894531
## AST OREB STL BLK
## 3.584875 4.371550 2.755248 2.544944
We started with the mostly full model that removed the columns that weren’t helpful to our goal of predicting player net rating. Using player_net_rating~.-First_Name-Last_Name-TEAM-Year-rating-per_game_rating-team_rating. We also removed categories wins, minutes, games played etc. which were not helpful from a prediction standpoint. We also then ran vif() and found a bunch of parameters that overlapped with each other causing multicollinearity issues. Such as total rebounds including offensive and defensive rebounds combined.
We didn’t want team to impact our results, the last two were used for creating the response and the others were essentially row identifiers. That left us with the full model above. After looking at the scatterplots though we decided to remove turnovers and personal fouls because what was happening is that players with higher turnovers and personal fouls actually had higher net ratings because these were the players that play a lot and have more opportunity to make mistakes or in some cases commit valuable fouls. The player’s that play a lot also tend to be the best players so we realized that we should leave those two out since the relationship was deceptive. The position category did have some value but we remove it here because we were more interested in what statistical categories a player could contribute to than what position they happen to marked as by their team. We will also ultimately reacquire the position information circumstantially if for example we discover that rebounds is highly predictive for net rating than we would know that centers provide outsize value because they are the ones who are often the bulk rebounders.
The other changes we made were to transform the three shooting percentages to powers of 2 because the scatterplots show a nonlinear relationship with net rating. We also did a transformation of the age because we could see that the youngest player had the most improvement as they got older, then we reached an almost plateau and the oldest players started to lose value as they aged (at a slower rate than the younger players gained so we also considered a log relationship). These older players then tended to retire before their value presumably would have dropped off even more. The last thing we did was modify the response to take the third root of the player net rating. We did this because our Q-Q plot and our fitted vs residuals plot were wildly off and were violating all our assumptions.
library(ggplot2)
residuals = resid(model)
predicted = predict(model)
#show the fitted vs redisuals
plot(fitted(model), resid(model), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
abline(h = 0, col = "darkorange", lwd = 2)
#show the Q-Q plot
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey")
qqline(resid(model), col = "dodgerblue", lwd = 2)
cooksd = cooks.distance(model)
plot(cooksd, pch="*", cex=2, main="Influence of Observations (Cook's distance)")
abline(h = 4*mean(cooksd, na.rm=T), col="red")
sum(cooks.distance(model) > 4 / length(cooks.distance(model)))
## [1] 39
Here we test the assumptions of our model and we can see some fairly good results. For linearity we used our scatterplots to check which parameters did not have linear relationships and transformed the ones that didn’t. Our fitted vs residuals plot looks mostly evenly scattered with maybe a handful of outliers but much better than our previous attempts and it the width of the points is about the same too. So we should feel good about homoscedasticity and independence now. In the Normal Q-Q plot our data adheres very close to our line, much better than our previous attempts. We can see very little tail which is good. So we can be fairly confident normality is not violated.
Let’s see if we can find a good model that uses fewer of our parameters:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
null_model <- lm(player_net_rating ~ 1, data = player_data)
scope <- list(lower = null_model, upper = model)
backward_aic <- step(model, direction = "backward", trace = FALSE)
forward_aic <- step(null_model, scope = scope, direction = "forward", trace = FALSE)
both_aic <- step(model, scope = scope, direction = "both", trace = FALSE)
backward_bic <- step(model, k = log(nrow(player_data)), direction = "backward", trace = FALSE)
forward_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "forward", trace = FALSE)
both_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "both", trace = 0)
print(backward_aic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(X3P.^2) + OREB +
## STL, data = player_data)
##
## Coefficients:
## (Intercept) I(X3P.^2) OREB STL
## 4.1898332 0.0001207 0.0018428 0.0042514
print(forward_aic)
##
## Call:
## lm(formula = player_net_rating ~ AST + I(1/(AGE^2)) + BLK, data = player_data)
##
## Coefficients:
## (Intercept) AST I(1/(AGE^2)) BLK
## -109.7533 0.2334 42830.8229 0.3568
print(both_aic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(X3P.^2) + OREB +
## STL, data = player_data)
##
## Coefficients:
## (Intercept) I(X3P.^2) OREB STL
## 4.1898332 0.0001207 0.0018428 0.0042514
print(backward_bic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ STL, data = player_data)
##
## Coefficients:
## (Intercept) STL
## 4.365881 0.005778
print(forward_bic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(1/(AGE^2)) + PTS +
## I(FG.^2) + I(X3P.^2) + I(FT.^2) + DREB + AST + OREB + STL +
## BLK, data = player_data)
##
## Coefficients:
## (Intercept) I(1/(AGE^2)) PTS I(FG.^2) I(X3P.^2)
## 3.991e+00 8.841e+01 -8.087e-05 -9.793e-05 1.170e-04
## I(FT.^2) DREB AST OREB STL
## 4.422e-05 -4.120e-05 -2.098e-04 3.135e-03 4.703e-03
## BLK
## 7.677e-04
print(both_bic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ STL, data = player_data)
##
## Coefficients:
## (Intercept) STL
## 4.365881 0.005778
We can see we have a lot of models that include steals and offensive rebounds and three point percentage also appear a couple times. We will now try including an interaction between steals and offensive rebounds.
new_model = lm(I(player_net_rating^(1/3)) ~ log(AGE)+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+DREB+AST+OREB*STL+BLK, data=player_data)
model = new_model
library(MASS)
null_model <- lm(player_net_rating ~ 1, data = player_data)
scope <- list(lower = null_model, upper = model)
backward_aic <- step(model, direction = "backward", trace = FALSE)
forward_aic <- step(null_model, scope = scope, direction = "forward", trace = FALSE)
both_aic <- step(model, scope = scope, direction = "both", trace = FALSE)
backward_bic <- step(model, k = log(nrow(player_data)), direction = "backward", trace = FALSE)
forward_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "forward", trace = FALSE)
both_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "both", trace = 0)
print(backward_aic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(FG.^2) + OREB + STL +
## OREB:STL, data = player_data)
##
## Coefficients:
## (Intercept) I(FG.^2) OREB STL OREB:STL
## 4.3486634 -0.0001456 0.0090851 0.0107939 -0.0001341
print(forward_aic)
##
## Call:
## lm(formula = player_net_rating ~ AST + log(AGE) + BLK, data = player_data)
##
## Coefficients:
## (Intercept) AST log(AGE) BLK
## 373.8390 0.2343 -128.0563 0.3555
print(both_aic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(FG.^2) + OREB + STL +
## OREB:STL, data = player_data)
##
## Coefficients:
## (Intercept) I(FG.^2) OREB STL OREB:STL
## 4.3486634 -0.0001456 0.0090851 0.0107939 -0.0001341
print(backward_bic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ OREB + STL + OREB:STL,
## data = player_data)
##
## Coefficients:
## (Intercept) OREB STL OREB:STL
## 4.0871716 0.0071683 0.0111633 -0.0001242
print(forward_bic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ log(AGE) + PTS + I(FG.^2) +
## I(X3P.^2) + I(FT.^2) + DREB + AST + OREB * STL + BLK, data = player_data)
##
## Coefficients:
## (Intercept) log(AGE) PTS I(FG.^2) I(X3P.^2) I(FT.^2)
## 4.597e+00 -1.820e-01 -1.300e-04 -1.276e-04 9.358e-05 3.982e-05
## DREB AST OREB STL BLK OREB:STL
## 1.099e-04 -1.506e-04 9.333e-03 1.145e-02 -3.680e-04 -1.287e-04
print(both_bic)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ OREB + STL + OREB:STL,
## data = player_data)
##
## Coefficients:
## (Intercept) OREB STL OREB:STL
## 4.0871716 0.0071683 0.0111633 -0.0001242
We ran these one more time including an interaction this time between steals and offensive rebounds and we got some models than include just those three parameters. We tried adding and subtracting the various shot percentage stats but none made a significant enough impact to include them. So leaving those out we now check our this smaller model against the new model from above.
new_model = lm(I(player_net_rating^(1/3)) ~ log(AGE)+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+DREB+AST+OREB*STL+BLK, data=player_data)
model4= lm(formula = I(player_net_rating^(1/3)) ~ OREB*STL, data = player_data)
summary(model4)
##
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ OREB * STL, data = player_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4551 -1.0344 0.0783 0.9905 3.9749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.087e+00 1.170e-01 34.928 < 2e-16 ***
## OREB 7.168e-03 2.014e-03 3.559 0.000392 ***
## STL 1.116e-02 2.796e-03 3.993 7.05e-05 ***
## OREB:STL -1.242e-04 3.973e-05 -3.127 0.001823 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.45 on 903 degrees of freedom
## (974 observations deleted due to missingness)
## Multiple R-squared: 0.02494, Adjusted R-squared: 0.02171
## F-statistic: 7.7 on 3 and 903 DF, p-value: 4.395e-05
anova(model4,new_model)
## Analysis of Variance Table
##
## Model 1: I(player_net_rating^(1/3)) ~ OREB * STL
## Model 2: I(player_net_rating^(1/3)) ~ log(AGE) + PTS + I(FG.^2) + I(X3P.^2) +
## I(FT.^2) + DREB + AST + OREB * STL + BLK
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 903 1898.0
## 2 895 1883.7 8 14.25 0.8464 0.5619
We see with only these three parameters we get a large p value from the anova test that has us preferring the smaller model. Our individual paramters also all have p values of significance. Our R squared values aren’t particularly high but that is the trade off for the leaner model.
hat_values <- hatvalues(model4)
high_leverage_points_indices <- which(hat_values > 2*length(coef(model4))/nrow(player_data)) # high leverage points
num_high_leverage_points <- sum(hat_values > 2*length(coef(model))/nrow(player_data))
cooks_D <- cooks.distance(model4)
influential_outlier_indices <- which(cooks_D > 4/nrow(player_data)) # influential points
num_influential_outliers <- sum(cooks_D > 4/nrow(player_data))
#print(influential_outlier_indices)
studentized_residuals <- rstudent(model4)
outlier_indices <- which(abs(studentized_residuals) > 3)
num_outliers <- sum(abs(studentized_residuals) > 3)
print(paste("Number of outliers:", num_outliers))
## [1] "Number of outliers: 2"
print(paste("Number of influential outliers:", num_influential_outliers))
## [1] "Number of influential outliers: 98"
print(paste("Number of high leverage points:", num_high_leverage_points))
## [1] "Number of high leverage points: 50"
Here we check our influential points, outlier points and high leverage points. This is how we knew it made sense to remove players with less than 100 minutes played because they were outliers in the dataset. For example, some players had a 0% field goal percentage which makes sense because if a player doesn’t take a lot of shots they might miss the one or two that they do take, which isn’t a good representative sample of a player’s abilities.
model = model4
library(ggplot2)
residuals = resid(model)
predicted = predict(model)
#show the fitted vs redisuals
plot(fitted(model), resid(model), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
abline(h = 0, col = "darkorange", lwd = 2)
#show the Q-Q plot
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey")
qqline(resid(model), col = "dodgerblue", lwd = 2)
cooksd = cooks.distance(model)
plot(cooksd, pch="*", cex=2, main="Influence of Observations (Cook's distance)")
abline(h = 4*mean(cooksd, na.rm=T), col="red")
sum(cooks.distance(model) > 4 / length(cooks.distance(model)))
## [1] 36
We now check the assumptions of our smaller model and they are even better than our previous assumption checks.
We end up with a model that indicates that steals and offensive rebounds are the best predictors for how a player contributes to their team’s success via their net rating. This make quite a lot of sense. The two stats have a major commonality. They both give their team an extra possession. An extra possession means a full opportunity to score again at any possible point value. The stats like the shot percentage stats are useful but not as much. If we think about it from an expected value perspective: let’s say league average field goal percentage is 50%. If a player gives their team an extra possession and they take a two point shot they gain 1 expected point. Versus not having an extra possession but instead boosting field goal percentage from 50% to a DeAndre Jordan like 76.3% in 2020-21. On a two point shot that would be worth an additional 0.526 points, less than the 1 we gained from our steal/offensive rebound.
Should NBA teams only go after players with high steal and offensive rebound totals? There a lot of other factors to consider in team building and improving shot percentage for example still matters (especially at the margins in close games) just maybe less so than these two categories do in the long run. So a better takeaway would be that there is significant value to gain from picking players that are better at steals and offensive rebounds than other similar players at their position and with similar stats.
Data from: https://www.nba.com/stats/players/traditional?PerMode=Totals&Season=2020-21&SeasonType=Regular%20Season&dir=A&sort=FG_PCT https://www.nba.com/stats/teams/traditional?SeasonType=Regular+Season&Season=2019-20 and https://www.basketball-reference.com/ https://www.basketball-reference.com/leagues/NBA_2023_totals.html